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Abstract 

Stochastic models of evolution (Markov random fields on trivalent trees) generally assume 
that different characters (different runs of the stochastic process) are independent and identi- 
cally distributed. In this paper we take the first steps towards addressing dependent characters. 
Specifically we show that, under certain technical assumptions regarding the evolution of indi- 
vidual characters, we can detect any significant, history independent, correlation between any 
pair of multistate characters. For the special case of the Cavender-Farris-Neyman (CFN) model 
on two states with symmetric transition matrices, our analysis needs milder assumptions. To 
perform the analysis, we need to prove a new concentration result for multistate random vari- 
ables of a Markov random field on arbitrary trivalent trees - we show that the random variable 
counting the number of leaves in any particular subset of states has variance that is subquadratic 
in the number of leaves. 



*Work done as a postdoctoral researcher at the Dept of Comp. and Inf. Science, University of Pennsylvania 



1 Introduction 



Estimating the phylogeny or evolutionary history of a set of organisms is an important problem in 
biology [HI [10]. The problem has the following general form: data is available about the species alive 
today, and the goal is to find a tree with these species at the leaves that "best explains" the data. 
Specific formulations of the problem are arrived at by specifying the type of data that is used and the 
notion of the tree that best explains this data. While early attempts at phylogeny construction used 
morphological characters, the data these days is derived by and large from biomolecular sequences 
such as protein and DNA sequences. Homologous sequences with presumed common evolutionary 
origin are observed for each of the species of interest. These sequences are aligned (as well as 
possible) in a multiple sequence alignment. Each position in such an alignment is called a character 
and the values taken on by these characters are called its states. Character-based methods such as 
parsimony seek a tree, and states at each internal node for each character, to minimize the number 
of state changes among the characters over all the edges of the tree. Distance-based methods 
convert this raw data into pairwise distances between species and seek an edge-weighted tree such 
that the input distances best fit the distances in the tree under various norms. 

Arguably, the most principled approach to phylogeny estimation is to select a family of stochastic 
models for character evolution, and find parameters for a model within this family that is most 
likely to have generated the data [9], [12] . Such stochastic models can be viewed as Markov random 
fields on rooted trivalent trees. The root has a state drawn from some distribution. Each node 
transmits its state to its two children via a Markovian process. For each character i, along a parent- 
child edge e = (u, v), there is a transition matrix M ej j(a, b) that defines for each pair of states a and 
b the probability that v is in state b given that u is in state a. When the character is clear from the 
context we refer to this matrix as M e and when the edge is also clear we refer to it simply as M. The 
unknown parameters are the shape of the tree and the transition matrices on each edge. The goal of 
statistical methods is to infer these parameters in order to maximize the probability of generating 
the given data. Many specific families of models have been considered. Among the simplest are 
two-state, symmetric models, called the Cavender-Farris-Neyman (CFN) [T71 [7\ [T] models, where 
on any edge e, each character has the same, symmetric transition matrix M e . 

Under these models, considerable work [8[R^[5lfT6 | ll5 1 [3] has been done to determine the number 
of characters needed to infer the phylogenetic tree under reasonable technical constraints on the 
transition matrices. Almost all of these works assume that the stochastic processes of the various 
characters are independent and identically distributed. This assumption might not be biologically 
realistic. Changes at one position of a DNA sequence or amino acid sequence are likely to be 
correlated with changes at other positions because of constraints on size, charge, hydrophobicity, etc. 
of the molecules involved. Thus we need to begin to understand how to infer both the evolutionary 
tree and the dependence relationship between characters when we drop the i.i.d. assumption. 
Of course, allowing arbitrary dependence between characters can lead to problems where there is 
insufficient information to reconstruct the tree. We need to carefully choose a dependence model 
that is tractable, yet realistic. In the literature no general procedures with provable properties have 
been proposed for detecting dependence. There are only simple, heuristic statistical approaches. 
(See, for example, [14] and the references cited therein.) In this paper we take the first steps along 
these lines: we suggest a simple model of preferred state dependence. In this model, two dependent 
characters are biased towards evolving to certain pairs of states in comparison to how they would 
behave if they were independent. We assume this bias to be uniform, and irrespective of the initial 
states of these characters. Under certain technical conditions we show that it is possible to infer 
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the (pairwise) dependence structure among various characters. 

Informal Statement of Results. We make the following technical assumption on the transition 
matrix M of the various Markov processes: the norm ||M|| = maxo^^n Hx^M ||i/||sg||i is upper 
bounded by A < 1. The assumption implies that the corresponding Markov chain is rapidly mixing. 

As stated above, the general stochastic model of evolution assumes a different transition matrix 
for every edge and character. If two characters are independent, then the joint transition ma- 
trix governing their evolution is the tensor product of the individual transition matrices. If two 
characters are dependent, then our model assumes that the joint transition matrix on any edge is 
significantly "far away" from the tensor product. Furthermore, we assume that this correlation is 
uniform across all edges, and history independent, that is, it is same irrespective of the initial state 
of the two characters. Formally, the difference between the two joint transition matrices, dependent 
and independent, is rank one. See Section [2] for precise definitions. 

The main tool we develop in this paper is a concentration bound for tree Markov random fields. 
Let Z be the random variable counting the number of occurrences of a character in a particular 
subset of states at the leaves of a rooted tree. We show that as long as ||M e || < A < 1 for each 
edge e's transition matrix, and the tree is trivalent (internal nodes have degree 3), the variance of 
Z is sub-quadratic in the number of leaves. For instance, this implies by Chebyshev's inequality, 
that Z is concentrated around its mean if the latter is of the order of the number of leaves. 

In the literature in various fields (computational biology p3], communications [6], statistical 
physics |lHH9j). concentration bounds have been derived for a similar random variable: for 2 state 
characters with states encoded as {— 1,+1} (instead of {0, 1} as we do), and each edge having a 
symmetric transition matrix (that is, the CFN model), the random variable is the sum of the leaf 
states. It is known |15j that there is a threshold mutation probability that decides whether the 
ratio of variance to squared mean is bounded or not. Note that we do not observe such a phase 
transition; this is not surprising since our random variable is shifted additively. Nonetheless, our 
concentration result is exactly what we need for our application. Furthermore, our result holds 
for any number of states, and possibly will have other applications besides this work. We use the 
above result to get the following results on detecting dependence. 

• For multistate characters, we show that if the joint transition matrix of two independent 
characters has bounded norm (< 1/2), then there exists an algorithm to detect dependence. 

• For the special case of the two state CFN model, we need the much milder assumption of 
| |.M e i|| < 1 for detecting dependence. 

Arguably, the assumption made in our second result above is biologically feasible and indeed has 
been made explicitly or implicitly in existing literature [21 El SI [S] ■ We believe the first assumption is 
a technicality, and conjecture that the analysis can be made to go through making only the weaker 
assumption of ||Af e i|| < 1. This requires understanding a Markov chain that changes it's transition 
matrix at each step, in particular, the relation between the stationary vector of this chain and the 
stationary vectors of the different transition matrices at each step. 

2 Preliminaries 

Tree Markov Random Fields. In a (rooted) tree Markov random field, we have a rooted tree T with 
root r. We assume in this paper that the tree is trivalent, that is, every internal node has degree 
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3 and the root has degree 2. This suffices for the application at hand, although our result holds 
more generally. Each node v € V(T) has an associated binary random variable X v taking values 
in a state space S of size \S\ = s. Each edge e € E(T) has an associated s x s transition matrix 
M e . In the ensuing stochastic process, the root random variable X r is associated with an arbitrary 
distribution over S. For every edge e = (u,v) with u as parent of v, we have 

P[X v = b\X u = a] = M e (a,b) 

for all a,b 6 S. Given a matrix M, define the following norm 

M := max " ,, ,, " 
0^x±l \\x\\i 

Observe that the norm is the maximum variational distance between the distributions induced by 
taking one step of the corresponding Markov chain, starting from two distinct distributions over 
the state space. The maximizing x can be thought of as p — q for two probability distributions 
p, q with disjoint support implying = 2; the observation follows since the numerator is twice 
the variational distance. Henceforth, we assume that all transition matrices in this paper have 
\\M\\ < A < 1. In other words, all the corresponding Markov chains are rapidly mixing. 

Given a subset of states let Za, or simply Z, denote the number of leaves of the tree T 

whose state belongs to A. In Section [3j we upper bound the variance of this random variable. 

Theorem 1. Given an n leaf, trivalent tree T, and a subset AQS, V(Z A ) = 0(n 2 ~ 2los ^ x ^). 

Phylogenetic Trees. For maximum likelihood methods, the process of evolution is typically modeled 
as a tree Markov random field. The n leaves of the tree denote extant species and the interior vertices 
represent speciation events. The root is the common ancestor of the n species. 

Each node of the tree is associated with a (aligned) sequence of k characters denoted as [k], 
which evolve down the tree. Each character takes a value from a certain set of states S such as 
{0, 1}, {A, C, G, T}, {20 amino acids}, etc. The size of S is assumed to be an arbitrary but finite 
constant. The random variable Xi(v) £ S is used to denote the state of character i at this vertex 
v. In general, each character i and edge e = (u, v) of the tree has an associated s x s transition 
matrix M e) j, where Va, b € S, M ej j(a, b) = P[Xi(v) = b \ Xi(u) = a]. Specific models assume that 
the edge transition matrices are drawn from specific families of matrices. A simple, popular model 
is the Cavender-Farris-Neyman (CFN) model where the characters are binary (with states and 
1) and the transition matrix is symmetric, with equal probabilities for state changes (mutations) 
from to 1 and 1 to 0. That is, M e [0, 1] = M e [l, 0] = p e . Thus, in this model, transition matrices 
are described by a single scalar parameter per edge. 

As mentioned in the introduction, most of the literature assumes that every character evolves 
in an i.i.d fashion. In this work, we are interested in detecting dependencies between pairs of 
characters, if any. We next define our model of dependence. 

Dependence Model. Fix a pair of characters i,j € [k]. Let X be the 'compound character' with 
X(u) on vertex u being the ordered pair (Xi(u),Xj(u)). Thus, X is defined on a state set of size 
s 2 . If Xi and Xj are independent, then observe that X also evolves as a Markov random field, 
with the s 2 x s 2 transition matrix at edge e defined as M e := M e ^ (g> M e j. More precisely, the s 2 
states are thought of a ordered pair of states, and (M e )(a,b),(a',fe') := M et i(a, a') ■ M e j(b,b') 



1 For brevity, henceforth, we will alternately say i and j are (in)dependent, to mean Xi and Xj are (in) dependent. 
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In our model of dependence, we assume that for two dependent characters, one can still define 
an s 2 x s 2 matrix M e for every edge e such that the evolution of the compound character can 
be described as a tree Markov random field with parameters M e . Furthermore, if i and j are 
dependent, we assume there is significant, uniform correlation which is history independent. To 
make this precise, consider any edge e = (u, v) in the tree and fix a pair of states (a, b) 65x5 for 
(Xi(u),Xj(u)). Note that M e induces a probability distribution over S x S at vertex v. Call this 
P$ b) (v). If i and j were independent, then call the probability distribution at v, Pm'-^m X v )- The 

'drift' caused by dependence is precisely the vector 5*f' b ^ '■= P^' b \ v ) ~ ^m^m^)- 

We assume that if i and j are dependent, then there exists a constant 5 > 0, such that: 

V(a,6) € S x S, ||<^ a ' b) ||i > 25 (Significant Correlation) 

Ve, e' € E, (a, 6) € S x 5, 4°' 6) = S^' b) (Uniform Dependence) 

V(a, 6), (a', 6') € 5 x 5, <^ a ' 6) = S{ a '' b,) (History Independence) 

In other words, there is a vector 5 € M 5x5 , ||<5||i > 25 and Ei^W = °> such that > 

we have 

M e — M e j <S> M e j = 1S T for each edge e, and dependent characters 

Given the dependence model above, an algorithm detects dependence if given any two characters 
i and j it can detect whether they are independent or dependent as per our model, with high 
probability (whp). By whp, we mean that the failure probability should be an inverse polynomial 
of the number of leaves. For general multistate characters, we give an algorithm for detecting 
dependence if the norm of each joint transition matrix is upper bounded away from 1/2. 

Theorem 2. If \\M e i M e j\ \ < 1/2 for all edges, then there exists an algorithm to detect depen- 
dence for multiple finite state characters. 

Special case: CFN Model. Recall in the CFN model that the transition matrix of each character 

(1 — p p \ 
j j where p := p e , is the probability of mutation on edge e. For 

dependent characters i and j, we assume there exists for each edge a 4-dimensional vector S such 
that ||<5||i > 5 (significant correlation) and M e = M e ^ ® M e j + 15 T (history independence). Note 
that since both M and Mj <8> Mj are probability matrices, we have Ylt=i ^00 = 0- F° r m °del, 
we can detect dependence with a much weaker assumption. 

Theorem 3. In the two state CFN model, there exists an algorithm to detect dependence between 
any two characters as long as ||M ei j|| < 1 for all edges. 

Comments on Assumptions. If there is no significant correlation between two dependent characters, 
then one cannot expect to detect dependence. Furthermore, 'low dependence' may not cause 
problems for algorithms that assume complete independence. The uniform dependence assumption 
can be relaxed to consistent dependence: if on a certain edge two dependent characters give a higher 
probability to the joint state (a, b) than they would have if they were independent, then they should 
prefer it, although maybe not to the same degree, on every edge. However, to keep the calculations 
clean we assume uniformity. Consistency seems to be a biologically resonable assumption. It is 
also infeasible to detect dependence with inconsistent characters with data just from the leaves. 
History independence is based on a memoryless principle for bias. This is a stronger (biological) 
assumption than the other two, however, it seems essential for this work, and we leave it as a 
challenge to relax or remove this. 



4 



3 Bounding the Variance 



In this section, we prove Theorem [U Fix any subset ACS. Recall that Z is the random variable 
counting the number of leaves of T in a state in A. More generally, we define Z{T U ) to be the 
number of leaves in the sub-tree of T rooted at it in a state in A. Recall A < 1 is an upper bound 
on the norm of any of the transition matrices. 

Theorem 4. (Restatement of Theorem^) Given anyn leaf trivalent tree T, V(Z) = 0(n 2_21og2 ( 1 / A ^) 

Proof. For any vertex v in V(T), let L v denote the leaves in the subtree rooted at v, and let Z v 
denote Z(T V ). Fix a leaf t and let e±, e^i ■ ■ ■ , &t De the edges on the path from the root to I. Let M 
denote the matrix M et ■ M et _ 1 • • • Mi; this is the transition matrix from root to leaf I. In particular, 
M is row-stochastic (row entries add up to 1). We state a couple of facts about row stochastic 
matrices. The lemma below was suggested to us by Nikhil Srivastava [20], and a similar lemma can 
be found as Lemma 4.12 in 1131. 



Lemma 1. For any two row stochastic matrices M\ and M%, we have \ \M\M<i\ \ < ||Mi|| • ||M2||. 

Proof. Note that given any ill, and any row stochastic matrix M, the vector y = x T M also satis- 
fies y -LI. This is because Ml 1. If a; is the vector achieving the maximum ||x T (MiM2)||i/||x||i, 
then we get that 

|| Ml M 2 || = MM^Ml = IgtjgP . 11*^11' < ||M 2 || . 

IfIIi If Mi||i IfIIi 

□ 

Lemma [T] implies that ||M|| is at most A*. In particular, this shows for a leaf £ at a distance t from 
a root vertex r, and for any states b, c G S, we have 

\P[X e G A\X r = b] -PpQ G A\X r = c]\< ||£ T M||i < 2A* (1) 

where x above is the vector with Xb = 1, x c = — 1, and x s = otherwise. 

Given a vertex v and two states b,c G 5, let A v (b, c) := |E(Zt,|Jf„ = b) — F,(Z V \X V = c)|. Expanding, 
A v (b, c) = | ( p [^ 6 A \ X - = fe ] " 6 A \ X - = °\) I 

< Yl l p t^ G A \ X v =b]-P[X t € A\X V = c] | 

leL v 

Using (HD, we have \P[X e G = b] - P[X e G = b]\ < 2\ d( - v < e \ implying 

A„(6, c) < A(v) 4 2 ^ A<*^ , Vw, 6, c (2) 

We now bound the variance of Z as a function of the A(i>)'s. 
Lemma 2. V(Z) < \ EveV(T)\r A » 
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Proof. Recall that T u denotes the subtree of T rooted at it and Z u = J2eeL(T u ) ^-x e eA- We now 
show using induction on the height that \(Z U \X U = b) < \ YlveV(T u )\u A 2 (i>) f° r an Y vertex it, and 
any b € S. Note that the claim is vacuously true when u is a leaf. 

Let it have children vi, . . . ,Vk (if the tree is binary, k = 2, but this lemma holds for any tree). 
Assume we have proved the inductive claim for the Uj's. Note that conditioned on X u , the random 
variables Z V1 , Z V2 , ... are independent, since they are on different subtrees. Therefore, for any b £ S, 

v(z u \x u = b) = EtiV(^|x u = b). 

We now show that for any parent-child pair e = (it, v) and any state b G S, we have 

V(Z v \X u = b) = Y,PbsV(Zv\X v = s) + ^ Yl PbsP b s*&l(s,s') (3) 

where Pb s = P[X V = s\X u = b]. 

We introduce some notational shorthand just to keep the calculation below simple. We forgo 
the subscript on Z v , let V := V(Z\X U = b), use "u = s" to imply X u = s, and use E 2 [Z] to denote 
(E[Z]) 2 . Now, by definition, V = B[Z 2 \u = b] - B 2 [Z\u = b]. The first term evaluates to 

B[Z 2 \u = b] = J2 PbsE[Z 2 \v = s] 

The second term evaluates to 

E 2 [Z\u = b] = (Y,PbsE[Z\v = s]\ =Y,Pl~V 2 [Z\v = s}+ P b sPbs>~E[Z\v = s]E[Z\v = s'] 

Observing P b 2 s = P bs - P bs (l - P bs ), we get V = 



Y^Pbs (E[Z> = s] -B 2 [Z\v = s])+Y,Pbs(l-Pbs)V 2 [Z\v = s]- Y PbsPbs>V[Z\v = s]E[Z\v = s'} 

s£S s€S s^s'GS 

The first term above is the first term in the RHS of ([3j) . Furthermore, noting that -P& s (l — Pb s ) = 
Ylsjts'es PbsPbs 1 since -P^'s sum up to 1, we get that the second two terms is 

\ Y P ^ P bs' {V 2 [Z\v = s] + B 2 [Z\v = s'} - 2B[Z\v = s]B[Z\v = s'}) = \ Y PbsPbs' A 2 v (s, s') 

which establishes (|3|). By induction, the first term in the RHS is at most \ J2 x eV(T v )\v A 2 (x). Since 
A t ,(s, s') < A(i>), we get that the second term is at most A 2 (v) ^2 S ^ S /^ PbsPbs' /2 < \& 2 {v). The 
last inequality follows from the fact Yl s ^ s 'eS PbsPbs' < (Eses Pbs) = !• 

□ 

We now bound YlveV(T)\r A 2 (v) for any n leaf trivalent tree. Recall, A(v) = 2J2£<=l v A^ 1 ''^. The 
following claim bounds A (it) for any it in a binary tree. 

Claim 1. For any vertex u with n leaves in its subtree, A (it) = 0(n 1_?? ) where r\ = log 2 (l/A). 
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Proof. Note that A (it) = 2^ i>x |Lj|A l where Lj is the set of leaves at a distance i from u. Since 
A < 1, an n leaf tree which maximizes A (it) will make the tree as balanced in height as possible. 
(This can be proved by a "swapping" argument similar to the proof of optimality of Huffman trees.) 
In particular, the maximizing tree has all leaves at distance |_log rij or [log nj + 1. Therefore, 
A(v) < | • nA lo e™ = § • n 1 "^ 1 /*) . □ 

Lemma 3. V(Z) = Ofa 2-21 ^ 1 /*))). 

Proof. Let V(n) be a function that denotes the maximum value of Y^veV(T)\r A 2 (t>) over all n-leaf 
binary trees. Now given an n-leaf tree T, let u be the centroid of T. That is, re/3 < \L U \ < 2n/3. 
It is easy to see this is well defined. Let T u denote the subtree of T rooted at u, and let T' u denote 
the subtree of T with all descendants of u deleted. Note that both T u and are binary trees, and 
have pn and (1 — p)n leaves for p £ [1/3,2/3]. By induction, we may assume 

A 2 (u) < V(pn), and A 2 (v) < V((l - p)n) 

veV(T u )\u «eVK)\r 

Suppose u = Uq, u±, . . . , u r = r is the unique path from u to r in T. Note that the A(u)'s in tree 
T' u are the same as in tree T for all vertices except the itj's. For each ttj, A(ttj) in the tree T is that 
in T' u plus A* • A(tt) Thus, we have 



A 2 («)< V(pn) + V((l-p)?i) + ^((A(n 4 ) + 2A 4 A(n)) 2 -A 2 (n i ; 

veV(T)\r i=0 

r r 

2i 



V(pn) + V((l - p)n) + 4A(u)^A i A(it i ) + 4A 2 (n)J^A 

i=0 i=0 

From Claim [H we can bound A(itj) by 0(re 1_r? ) for i = 0, ...,r. So we get the following recurrence 
for V(n) 

V(n) < V(pn) + V((l - p)n) + 0(n 2 - 2r? ) 
which evaluates to V(n) = 0(n 2 ~ 2r? ). □ 
Lemma [2] and [3] prove the theorem. □ 
Corollary 1. 7/E[Z] = ^(n 1 -^) for any e > 0, i/ien P[Z G (1 ± p)E[Z]] > 1 - . 
Proof. Follows from a direct application of Chebyshev's inequality. □ 



4 Detecting Dependence 

4.1 General Multistate Characters 

Fix characters i and j. For brevity's sake, let N e := M e ^ <g> M ej j. Recall by assumption, we 
have ||iV e || < A < 1/2. Let X be the compound character taking values in S x S with X(u) := 
(Xi(u), Xj(u)). By our model of dependence, if i andj are dependent, then the compound character 
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evolves as a tree Markov random field process with transition matrices M e on each edge e, satisfying 
the property M e — N e = 18 T where ||<5||i = 25 > and SseSxS<K s ) = 0. Since \ \S\\i = 25, we get 
there exists a subset S* C S x S such that ^ses* <K s ) = ^- That is, dependent characters put an 
extra probability mass of 5 on the states in S* , as compared to independent characters. 

Fix a leaf £ of the tree and let (ex,... , &t) be the path from root to I. Note that the matrices 
M := M ei M e2 ■ ■ ■ M et and N := N ei N e2 ■ ■ ■ N et denote the transition probability matrices for the 
compound character from root to leaf when the two characters are dependent and independent 
respectively. Our first claim shows that if each M e puts significantly more mass on the states in S* 
than N e , then so does M over N. 

Lemma 4. Let E := M — N . Then E has all rows equal, and the entries on the columns corre- 
sponding to S* sum to > X'5, where A' = 1 — jz^- 

Proof. Let's start with a definition. 

Definition 1. A matrix is called an error matrix if all it's rows are equal and sum to 0. 

Note that the matrix D := 1S T = M ei — N ei for each i, is an error matrix. The following are a few 
easy to check properties of error matrices. 

Claim 2. l)The sum of two error matrices is an error matrix. 

2) The product of two error matrices is the all zeros matrix. 

3) The product of a row stochastic matrix with an error matrix is the error matrix itself. 

4 ) The product of an error matrix with a row stochastic matrix is another error matrix. 

The above claim gives us, 

M = (N ei + D)(N e2 + D)---{N et +D) = N + D + D(N ei + iV ei iV e2 + + N ei N e2 ■ ■ ■ A^J, 

This gives E = D + DN* , where N* is the sum of the matrices in the parentheses of the RHS 
above. N* is also row stochastic, and furthermore, ||iV*|| < (A + • • • + A*" 1 ) < jK-. Here we are 
implicitly using Lemma [T] and the fact that all norms are at most A. Therefore, DN* is an error 
matrix (and hence so is E) where each row has l\ norm at most j^25. Thus, the coordinates of 
DN* corresponding to S* sum to at least — jzr\5. This follows from the simple fact: 

Fact 1. Given any vector S € M n with ^ S(i) = and \\S\\\ < 2e, then max5 C [ n ] | YlieS — e - 

Therefore, the sum of the columns of each row of E corresponding to S is at least 5(1 — t^t). Q 

Note that if A < 1/2, we get that A' is a constant bounded away from 0. Now we are ready 
to state our algorithm for detecting dependence in the multistate character case. The idea is 
pretty simple: if i and j are dependent, we expect a much more frequent occurrence of the states 
in S* in {(Xi(£), Xj(£) : £ G L(T)}, than what is expected if they were independent. One can 
exhaustively search for this set S* since the number of states is a constant; the analysis follows 
almost immediately as a corollary of Theorem [TJ 
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Algorithm Multistate Dependence Detection. 
Input: States of the characters at n leaves,?, j 
Parameter: Precision parameter e' 

2 

1. Search for the special set S* C S x S exhaustively by going over all 2 s 
(recall s is a constant). 

2. Let Z denote the fraction of leaves I such that (Xi(£), Xj(£)) € S*. 

3. Let Z a denote the fraction of leaves in state o, and let Z = Yl( a b)es* Z a Zb- 

4. If Z > (1 — e')(Z + X'S), declare i,j dependent. Else, declare i, j independent. 

Theorem 5 (Implies Theorem [2|). Algorithm Multistate Dependence Detection is correct. 

Proof. (Sketch) Suppose i and j are dependent. We show our algorithm declares them to be 
dependent whp. The proof when i and j are independent is similar. Note that if i and j were 
independent, then E[Z] = Y^(a,b)eS* E[Z a ]E[Z b ], and thus by Lemma H we get E[Z] > E[Z] + X'S. 
By Corollary [H we have that Z > (1 - e)(E[Z] + X'S) whp. 

Furthermore, note that whenever E[Z a ] = 0(1), Corollary Q] gives E[Z a ] > Z a /(1 + e) whp. 
This implies that even if E[Z a ]E[Zb] = 0(1) for any (a, b) G S* , we get whp that E[Z] > 
itpi S(afc)G5* Z a Zb- If all of the EfZjEfZfe] = o(l), and so E[Z] = o(l) (finitely many characters), 
then Markov will imply that whp Z <t. eX'd.In either case, we get whp Z > + (1 - e)X'S > 

(1 — e')(Z + X'5) for the proper choice of e'. To make the proof precise one needs to make o(l) 
precise; we avoid this in this abstract. 

□ 

4.2 Symmetric, two state CFN Model 

For the case of two state, symmetric CFN model, we can analyze the counting algorithm similar 
to that of the previous subsection with a much weaker assumption. Note that the joint transition 
matrix in this case is a 4 x 4 matrix. The reason we are able to obtain a better analysis in the 
2-state symmetric model is because we can map any 4x4 'history independent' transition matrix 
onto a two state Markov chain. Two state Markov chains are much more well-behaved structures 
than those in higher states. We elucidate below. 

By symmetry, we can assume that the special state preferred by the dependent characters is 
00. Consider the three partitions of the state space: 

N = {(00),(ll)};n = {(01),(10)},or 
N = {(00),(01)}p = {(10),(ll)},or 
N = {(00),(10)};3 = {(01),(11)} 

Note that for one of the three choices, we have |^sgn^( s )I > 25/3. By symmetry, suppose 
N is such that the sum is positive. The reason for choosing the partitions is the following. We 
claim that any 4x4 transition matrix of the form M e = M e ^ ® M e j + 1S T can be mapped onto a 
transition matrix M' e on two states for any of the choices of K,D above. It suffices to check 

that for each choice above that the probability of going from either state in ^ to 3 is the same. For 
instance, suppose we are in the first partition. Then, we have 

M[00, 01] + M[00, 10] = (1 - p)q + p(l - g) + 6(01) + 6(10) = M[ll, 01] + Af [11, 10] 
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We are using crucially here the symmetry of the original 2x2 transition matrix and the history 
independence. The other cases can be similarly checked. Also note that each probability is bounded 
away from and 1 by constants depending only on 5 and A. The following claim summarizes the 
discussion above. 

Claim 3. For any edge e and for any choice o/K,D above, there exists a well-defined transition 
matrix M' e on the which is induced by the transition matrix M e . Furthermore, there is a 

choice of K,3 such that M' = ( ^ ^ e ~t ^ ^ e ? ] for some p e > C\, and some 6 > 25/3. 

V Pe + o 1 - p e - o J 

We are now ready to state the algorithm. 



Algorithm CFN Dependence Detection. 
Input: States of the characters at n leaves. 



Parameter: Constant C( x, defined later. 



For the three choices of N and D defined above, evaluate the fraction of leaves whose zth 
and jth characters are in one of the states in If for any one of them, this fraction is 
larger than | + 0.1Ca,5, declare dependent. Else, declare independent. 



Theorem 6 (Implies Theorem [3]). Algorithm CFN Dependence Detection is correct. 

Proof. (Sketch) Let Z denote the number of leaves in state If characters i and j are independent, 
then the transition matrix on any edge e is the tensor product of two symmetric matrices and hence 
symmetric. Therefore the stationary matrix is the all quarters vector. Since most leaves are at a 
distance > logn from the root, and the chain is rapidly mixing (A < 1), we get E[Z] £ (l/2±o(l))n. 
By Corollary [JJ whp Z < (1/2 + e') for any e' > 0. Thus, whp independent characters are returned 
as independent. 

Suppose i and j are dependent, and suppose is the partition which achieves the transition 
matrix as in ClaimO Fix a leaf t and let (ei , . . . , et) be the path from root to l\ let M' = M' ei ■ ■ ■ M' et . 
Note that if t is large enough, M'(K, ~ M'(3, N). It suffices to show that this quantity is bounded 
away from 1/2 by a quantity C' xs > 0. This implies E[Z] > (1/2 + C' x s )n, and the proof follows 
via Corollary [TJ 

( \ — p p \ 

Given a 2 x 2 row stochastic matrix M = I ^ I with q > p, define the asymmetry 

p{M) = q/p. 

Fact 2. Given two row stochastic matrices Mi and M^, p(M\M2) lies between p[M\) and p{M<i). 
Proof. If the asymmetry factor of Mj is qi/pi, then that of M1M2 is ^ 2 ffife + 92 )) w j 1 j cn 

[.Pi +P2 -P1{P2 + <?2j) 

lies between qi/pi and q2/p2- D 

From the above fact, we get that p(M') > rami p{M' e .) . Since each M' e . has asymmetry at least 

, we get that p(M') is bounded away from 1. This implies that M'(3, is bounded away from 

Pe 

1/2 by a constant C' x s only depending on A and 5. □ 

Acknowledgements. We would like to thank Elchanan Mossel and Nikhil Srivastava for useful 
discussions. 
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